Code
# import libraries
import pandas as pd
import numpy as np
import os
import plotly.express as pxThis notebook is focused on the exploratory and descriptive analysis of the cleaned version of the Telecom Churn Rate Dataset.To begin with, we import all the necessary Python libraries required for data handling, analysis, and processing.
pandas for data manipulationnumpy for numerical operationsos for handling directory paths# import libraries
import pandas as pd
import numpy as np
import os
import plotly.express as pxWe define the folder structure to organize our project. This includes:
data/raw for raw data filesdata/processed for cleaned and transformed datasetsresults for any output files such as plots or modelsdocs for documentation or rendered outputs (e.g., HTML, PDF)We also ensure these directories are created if they don’t already exist.
Now we load the raw dataset (Telecom Churn Rate Dataset) from the raw directory using pandas.read_excel().
Missing values are identified using the placeholder ?, which we convert to NaN.
telecom_filename = os.path.join(processed_dir,'cleaned.xlsx')
telecom_df = pd.read_excel(telecom_filename)
telecom_df.head(10)| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | numAdminTickets | numTechTickets | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-vhveg | female | 0 | yes | no | 1 | no | no phone service | digital subscriber line | no | ... | no | no | month-to-month | yes | electronic check | 29.85 | 29.85 | 0 | 0 | no |
| 1 | 5575-gnvde | male | 0 | no | no | 34 | yes | no | digital subscriber line | yes | ... | no | no | one year | no | mailed check | 56.95 | 1889.50 | 0 | 0 | no |
| 2 | 3668-qpybk | male | 0 | no | no | 2 | yes | no | digital subscriber line | yes | ... | no | no | month-to-month | yes | mailed check | 53.85 | 108.15 | 0 | 0 | yes |
| 3 | 7795-cfocw | male | 0 | no | no | 45 | no | no phone service | digital subscriber line | yes | ... | no | no | one year | no | bank transfer (automatic) | 42.30 | 1840.75 | 0 | 3 | no |
| 4 | 9237-hqitu | female | 0 | no | no | 2 | yes | no | fiber optic | no | ... | no | no | month-to-month | yes | electronic check | 70.70 | 151.65 | 0 | 0 | yes |
| 5 | 9305-cdskc | female | 0 | no | no | 8 | yes | yes | fiber optic | no | ... | yes | yes | month-to-month | yes | electronic check | 99.65 | 820.50 | 0 | 0 | yes |
| 6 | 1452-kiovk | male | 0 | no | yes | 22 | yes | yes | fiber optic | no | ... | yes | no | month-to-month | yes | credit card (automatic) | 89.10 | 1949.40 | 0 | 0 | no |
| 7 | 6713-okomc | female | 0 | no | no | 10 | no | no phone service | digital subscriber line | yes | ... | no | no | month-to-month | no | mailed check | 29.75 | 301.90 | 0 | 0 | no |
| 8 | 7892-pookp | female | 0 | yes | no | 28 | yes | yes | fiber optic | no | ... | yes | yes | month-to-month | yes | electronic check | 104.80 | 3046.05 | 0 | 2 | yes |
| 9 | 6388-tabgu | male | 0 | no | yes | 62 | yes | no | digital subscriber line | yes | ... | no | no | one year | no | bank transfer (automatic) | 56.15 | 3487.95 | 0 | 0 | no |
10 rows × 23 columns
telecom_filename = os.path.join(processed_dir,'cleaned.xlsx')
telecom_df = pd.read_excel(telecom_filename)
telecom_df.head(10)We use .shape to see the number of rows and columns in the dataset. This helps us understand the scale of the data.
The .info() method gives a concise summary of the dataset, including:
This helps us assess the completeness of the data and identify potential issues.
summary_df = pd.DataFrame({
'Column': telecom_df.columns,
'Data Type':telecom_df.dtypes.values,
'Missing Values': telecom_df.isnull().sum().values
})
summary_df| Column | Data Type | Missing Values | |
|---|---|---|---|
| 0 | customerID | object | 0 |
| 1 | gender | object | 0 |
| 2 | SeniorCitizen | int64 | 0 |
| 3 | Partner | object | 0 |
| 4 | Dependents | object | 0 |
| 5 | tenure | int64 | 0 |
| 6 | PhoneService | object | 0 |
| 7 | MultipleLines | object | 0 |
| 8 | InternetService | object | 0 |
| 9 | OnlineSecurity | object | 0 |
| 10 | OnlineBackup | object | 0 |
| 11 | DeviceProtection | object | 0 |
| 12 | TechSupport | object | 0 |
| 13 | StreamingTV | object | 0 |
| 14 | StreamingMovies | object | 0 |
| 15 | Contract | object | 0 |
| 16 | PaperlessBilling | object | 0 |
| 17 | PaymentMethod | object | 0 |
| 18 | MonthlyCharges | float64 | 0 |
| 19 | TotalCharges | float64 | 0 |
| 20 | numAdminTickets | int64 | 0 |
| 21 | numTechTickets | int64 | 0 |
| 22 | Churn | object | 0 |
Numerical variables
telecom_df.describe()| SeniorCitizen | tenure | MonthlyCharges | TotalCharges | numAdminTickets | numTechTickets | |
|---|---|---|---|---|---|---|
| count | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 | 7043.000000 |
| mean | 0.162147 | 32.371149 | 64.761692 | 2279.734304 | 0.515689 | 0.419566 |
| std | 0.368612 | 24.559481 | 30.090047 | 2266.794470 | 1.275299 | 1.250117 |
| min | 0.000000 | 0.000000 | 18.250000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 9.000000 | 35.500000 | 398.550000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 29.000000 | 70.350000 | 1394.550000 | 0.000000 | 0.000000 |
| 75% | 0.000000 | 55.000000 | 89.850000 | 3786.600000 | 0.000000 | 0.000000 |
| max | 1.000000 | 72.000000 | 118.750000 | 8684.800000 | 5.000000 | 9.000000 |
telecom_df.describe(include='object')| customerID | gender | Partner | Dependents | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 |
| unique | 7043 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 4 | 2 |
| top | 7590-vhveg | male | no | no | yes | no | fiber optic | no | no | no | no | no | no | month-to-month | yes | electronic check | no |
| freq | 1 | 3555 | 3641 | 4933 | 6361 | 3390 | 3096 | 3498 | 3088 | 3095 | 3473 | 2810 | 2785 | 3875 | 4171 | 2365 | 5174 |
telecom_df['gender'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')gender variable.
| unique values | proportion | |
|---|---|---|
| 0 | male | 0.504756 |
| 1 | female | 0.495244 |
telecom_df['Partner'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')partner variable.
| unique values | proportion | |
|---|---|---|
| 0 | no | 0.516967 |
| 1 | yes | 0.483033 |
telecom_df['Dependents'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')Dependents variable.
| unique values | proportion | |
|---|---|---|
| 0 | no | 0.700412 |
| 1 | yes | 0.299588 |
telecom_df['PhoneService'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')PhoneService variable.
| unique values | proportion | |
|---|---|---|
| 0 | yes | 0.903166 |
| 1 | no | 0.096834 |
telecom_df['MultipleLines'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')MultipleLines variable.
| unique values | proportion | |
|---|---|---|
| 0 | no | 0.481329 |
| 1 | yes | 0.421837 |
| 2 | no phone service | 0.096834 |
telecom_df['InternetService'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')InternetServices variable.
| unique values | proportion | |
|---|---|---|
| 0 | fiber optic | 0.439585 |
| 1 | digital subscriber line | 0.343746 |
| 2 | no | 0.216669 |
telecom_df['OnlineSecurity'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')OnlineSecurity variable.
| unique values | proportion | |
|---|---|---|
| 0 | no | 0.496663 |
| 1 | yes | 0.286668 |
| 2 | no internet service | 0.216669 |
telecom_df['OnlineBackup'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')OnlineBackup variable.
| unique values | proportion | |
|---|---|---|
| 0 | no | 0.438450 |
| 1 | yes | 0.344881 |
| 2 | no internet service | 0.216669 |
telecom_df['DeviceProtection'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')DeviceProtection variable.
| unique values | proportion | |
|---|---|---|
| 0 | no | 0.439443 |
| 1 | yes | 0.343888 |
| 2 | no internet service | 0.216669 |
telecom_df['TechSupport'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')TechSupport variable.
| unique values | proportion | |
|---|---|---|
| 0 | no | 0.493114 |
| 1 | yes | 0.290217 |
| 2 | no internet service | 0.216669 |
telecom_df['StreamingTV'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')StreamingTV variable.
| unique values | proportion | |
|---|---|---|
| 0 | no | 0.398978 |
| 1 | yes | 0.384353 |
| 2 | no internet service | 0.216669 |
telecom_df['Contract'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')Contract variable.
| unique values | proportion | |
|---|---|---|
| 0 | month-to-month | 0.550192 |
| 1 | two year | 0.240664 |
| 2 | one year | 0.209144 |
telecom_df['PaperlessBilling'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')PaperlessBilling variable.
| unique values | proportion | |
|---|---|---|
| 0 | yes | 0.592219 |
| 1 | no | 0.407781 |
telecom_df['PaymentMethod'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')PaymentMethod variable.
| unique values | proportion | |
|---|---|---|
| 0 | electronic check | 0.335794 |
| 1 | mailed check | 0.228880 |
| 2 | bank transfer (automatic) | 0.219225 |
| 3 | credit card (automatic) | 0.216101 |
In our dataset, there is a Churn column that indicates whether a customer has left the telecom service or not:
What is Churn Rate?
In the telecom industry, Churn Rate refers to the percentage of customers who stop using the company’s services during a specific period of time.
It is a key performance indicator (KPI) used to measure customer retention and loportant?
High churn rates can signal:
Reducing churn is crucial for maintaining revenue and customer base.
| not churned | churned | |
|---|---|---|
| 0 | no | 0.73463 |
| 1 | yes | 0.26537 |
After analyzing the dataset, we observe the following churn rate:
26.5% of customers have churned (Churn = Yes)
73.5% of customers have remained loyal (Churn = No)
Out of every 100 customers, about 27 leave the telecom service, while 73 continue using it.
A churn rate of 26.5% is relatively high in the telecom industry and could be a sign of underlying issues such as:
Understanding churn rate is essential because:
This insight sets the foundation for deeper analysis, such as: - Segmenting churned users - Identifying top predictors of churn - Designing customer retention strategies
C:\Users\ADELINE PC\AppData\Local\Temp\ipykernel_13196\2785251253.py:50: DeprecationWarning:
Support for the 'engine' argument is deprecated and will be removed after September 2025.
Kaleido will be the only supported engine at that time.
C:\Users\ADELINE PC\AppData\Local\Temp\ipykernel_13196\2785251253.py:52: DeprecationWarning:
Support for the 'engine' argument is deprecated and will be removed after September 2025.
Kaleido will be the only supported engine at that time.
The dataset includes customer information classified by gender. Here’s the breakdown:
Female: 3,488 customers
Male: 3,555 customers
The gender distribution is nearly balanced, with 51.9% male and 48.1% female.
Since the difference is small, gender is unlikely to be a strong biasing factor in this dataset.
The dataset includes information about the preferred payment methods used by customers. The distribution is as follows:
telecom_df_payment_method = telecom_df['PaymentMethod'].value_counts(normalize=True).round(2).reset_index(name='payment_method_rate')
telecom_df_payment_method| PaymentMethod | payment_method_rate | |
|---|---|---|
| 0 | electronic check | 0.34 |
| 1 | mailed check | 0.23 |
| 2 | bank transfer (automatic) | 0.22 |
| 3 | credit card (automatic) | 0.22 |
C:\Users\ADELINE PC\AppData\Local\Temp\ipykernel_13196\751501223.py:48: DeprecationWarning:
Support for the 'engine' argument is deprecated and will be removed after September 2025.
Kaleido will be the only supported engine at that time.
C:\Users\ADELINE PC\AppData\Local\Temp\ipykernel_13196\751501223.py:50: DeprecationWarning:
Support for the 'engine' argument is deprecated and will be removed after September 2025.
Kaleido will be the only supported engine at that time.
Understanding the distribution of payment methods is important for: - Customer segmentation - Identifying churn risks (e.g., manual payments may be associated with higher churn) - Designing targeted financial services or offers
The dataset tracks whether customers opted for paperless billing or not. The distribution is as follows:
telecom_df_billing= telecom_df['PaperlessBilling'].value_counts(normalize=True).reset_index(name='paperless_rate').round(2)
telecom_df_billing| PaperlessBilling | paperless_rate | |
|---|---|---|
| 0 | yes | 0.59 |
| 1 | no | 0.41 |
Understanding billing preferences can help the company: - Promote environmentally friendly practices - Reduce mailing costs
Customers in the dataset are categorized based on the type of contract they hold. Here’s the breakdown:
telecom_df_contract = telecom_df.groupby('Contract').size().reset_index(name='Number_of_contract')
telecom_df_contract['Number_of_contract'] = telecom_df_contract['Number_of_contract'].round(2)
telecom_df_contract| Contract | Number_of_contract | |
|---|---|---|
| 0 | month-to-month | 3875 |
| 1 | one year | 1473 |
| 2 | two year | 1695 |
In this analysis, we examined whether MonthlyCharges vary significantly among customers with different contract types. The three categories of contracts analyzed were Month-to-month, One year, and Two year. Understanding how monthly charges differ across these groups helps uncover potential pricing strategies and customer behavior patterns associated with each contract type.
Results Interpretation
| p-value | Significant? | Conclusion |
|---|---|---|
| < 0.05 | Yes | Reject H₀ – No Significant difference in MonthlyCharges across contract types |
| ≥ 0.05 | No | Fail to Reject H₀ – significant difference in MonthlyCharges across contract types |
C:\Users\ADELINE PC\AppData\Local\Temp\ipykernel_13196\2235245424.py:16: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
C:\Users\ADELINE PC\AppData\Local\Temp\ipykernel_13196\2235245424.py:29: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
Shapiro-Wilk Normality Test p-values:
Month-to-month: p = 0.0000
One year: p = 0.0000
Two year: p = 0.0000
Levene’s Test p-value: 0.0000
Using Welch’s ANOVA (unequal variances)
sum_sq df F PR(>F)
Contract 3.182336e+04 2.0 17.67295 2.207693e-08
Residual 6.338399e+06 7040.0 NaN NaN
ANOVA F-statistic: 17.6730
ANOVA p-value: 0.0000
Reject H₀: No Significant difference in MonthlyCharges across contract types.
The statistical analysis confirms that contract type significantly influences monthly charges. Customers with Month-to-month contracts pay more on average, which may be due to:
-Lack of discounts available in long-term plans, -Flexibility priced at a premium,
Telecom providers might use this insight to design targeted retention strategies—such as offering discounts or bundling services—to encourage month-to-month customers to switch to long-term contracts, potentially improving customer lifetime value.